<!DOCTYPE html>
<html lang="en">

<!-- Head tag -->
<head><meta name="generator" content="Hexo 3.9.0">

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <!--Description-->
    
        <meta name="description" content="通用爬虫
通用网络爬虫 从互联网中搜集网页，采集信息，这些网页信息用于为搜索引擎建立索引从而提供支持，它决定着整个引擎系统的内容是否丰富，信息是否即时，因此其性能的优劣直接影响着搜索引擎的效果。
不扯没用的，上干货！
创建项目：
　　cmd 命令： scrapy startproject 项目名 
">
    

    <!--Author-->
    
        <meta name="author" content="ck">
    

    <!--Open Graph Title-->
    
        <meta property="og:title" content="CrawlSpider---&gt;流程">
    

    <!--Open Graph Description-->
    

    <!--Open Graph Site Name-->
    <meta property="og:site_name" content="CK&#39;s blogs">

    <!--Type page-->
    
        <meta property="og:type" content="article">
    

    <!--Page Cover-->
    

    <meta name="twitter:card" content="summary">
    

    <!-- Title -->
    
    <title>CrawlSpider---&gt;流程 - CK&#39;s blogs</title>

    <!-- Bootstrap Core CSS -->
    <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/bootstrap/4.0.0-alpha.2/css/bootstrap.min.css" integrity="sha384-y3tfxAZXuh4HwSYylfB+J125MxIs6mR5FOHamPBG064zB+AFeWH94NdvaCBm8qnd" crossorigin="anonymous">

    <!-- Custom Fonts -->
    <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" rel="stylesheet" type="text/css">

    <!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
    <!--[if lt IE 9]>
        <script src="//oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
        <script src="//oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
    <![endif]-->

    <!-- Gallery -->
    <link href="//cdnjs.cloudflare.com/ajax/libs/featherlight/1.3.5/featherlight.min.css" type="text/css" rel="stylesheet">

    <!-- Custom CSS -->
    <link rel="stylesheet" href="/blog/css/style.css">

    <!-- Google Analytics -->
    


</head>


<body>

<div class="bg-gradient"></div>
<div class="bg-pattern"></div>

<!-- Menu -->
<!--Menu Links and Overlay-->
<div class="menu-bg">
    <div class="menu-container">
        <ul>
            
            <li class="menu-item">
                <a href="/blog/">
                    Home
                </a>
            </li>
            
            <li class="menu-item">
                <a href="/blog/archives">
                    Archives
                </a>
            </li>
            
            <li class="menu-item">
                <a href="/blog/2019/09/04/本人简历/">
                    About
                </a>
            </li>
            
            <li class="menu-item">
                <a href="/blog/tags">
                    Tags
                </a>
            </li>
            
            <li class="menu-item">
                <a href="/blog/categories">
                    Categories
                </a>
            </li>
            
            <li class="menu-item">
                <a href="/blog/contact.html">
                    Contact
                </a>
            </li>
            
        </ul>
    </div>
</div>

<!--Hamburger Icon-->
<nav>
    <a href="#menu"></a>
</nav>

<div class="container">

    <!-- Main Content -->
    <div class="row">
    <div class="col-sm-12">

        <!--Title and Logo-->
        <header>
    <div class="logo">
        <a href="/blog/"><i class="logo-icon fa fa-cube" aria-hidden="true"></i></a>
        
    </div>
</header>

        <section class="main">
            
<div class="post">

    <div class="post-header">
        <h1 class="title">
            <a href="/blog/2018/07/19/CrawlSpider/">
                CrawlSpider--->流程
            </a>
        </h1>
        <div class="post-info">
            
                <span class="date">2018-07-19</span>
            
            
            
        </div>
    </div>

    <div class="content">

        <!-- Gallery -->
        

        <!-- Post Content -->
        <p>通用爬虫</p>
<p>通用网络爬虫 从互联网中搜集网页，采集信息，这些网页信息用于为搜索引擎建立索引从而提供支持，它决定着整个引擎系统的内容是否丰富，信息是否即时，因此其性能的优劣直接影响着搜索引擎的效果。</p>
<p>不扯没用的，上干货！</p>
<p>创建项目：</p>
<p>　　cmd 命令： scrapy startproject 项目名 </p>
<p>创建</p>
<p>　　cmd 命令：scrapy genspider -t crawl 爬虫名 允许爬取得域名</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">在spider文件 主爬虫文件.py中 替换 start_urls 为要爬取的网址！</span><br><span class="line"> </span><br><span class="line">在rules中进行指定规则：</span><br><span class="line">　　ps:规则制定时选中的必须是标签，或正则匹配连接地址（可跳转性）</span><br></pre></td></tr></table></figure>

<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">rules = (</span><br><span class="line">    Rule(LinkExtractor(allow=r&apos;/sort/?st=product&amp;sort=A\[A-Z]&apos;), follow=True),   # 参数： allow 正则选择要满足的需求特点  deny ：一定不取得特点</span><br><span class="line">    Rule(LinkExtractor(restrict_xpaths=&quot;//a[text()=&apos;»&apos;]&quot;), follow=True),　　　　# 参数 restrict_xpaths 用xpath 反向选择或手写xpath取要满足的需求特点</span><br><span class="line">    Rule(LinkExtractor(restrict_xpaths=&quot;/html/body/div[4]/div/div[2]/div[4]/div/div[1]/div/a&quot;), callback=&apos;parse_data&apos;,follow=False),  # 拷贝xpath选择要满足的需求特点</span><br><span class="line">)</span><br></pre></td></tr></table></figure>

<p>　ps：　　　</p>
<figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line">　　LinkExtractor（）   链接提取器  对选中连接或标签进行操作</span><br><span class="line">　　follow=True  同位置 继续 提取需求标签</span><br><span class="line">　　follow=False  停止 或 不在 同位置继续提取需求标签</span><br><span class="line">　　clllback=&apos;一个函数名&apos;  follow=True 时 没有太大必要使用，除非该页有需求值 follow=False 时 说明你到了目标数据位置 这是把请求的响应抛给了该函数</span><br><span class="line"></span><br><span class="line">而到了函数，就可以利用xpath取值了，封装进item里就可以了，这里item不需要你实例化了item= &#123;&#125;就可以了 通用爬虫帮你做了</span><br><span class="line"></span><br><span class="line">最后yield item</span><br><span class="line"></span><br><span class="line">主爬虫就写完了。</span><br><span class="line"></span><br><span class="line">然后 --》 ？？？</span><br><span class="line">然后的是就和普通scrapy一样了，</span><br><span class="line">settings.py 里配置 在之前写的博文里有，自行去查看</span><br><span class="line">item.py 里 设置需求数据字段 </span><br><span class="line">pipelines.py 对数据队列 进行操作</span><br><span class="line"></span><br><span class="line">Ps:</span><br><span class="line">了解不深，个人看法：</span><br><span class="line">其实理解这个通用爬虫不难：</span><br><span class="line">就一句话：</span><br><span class="line"> 　　需要注意的点就在rules这块对吧，我的理解就是，这是在宏观意义上给你要提取输得位置进行制定规则，可以理解为定位，所有满足特点的位置就是我需求数据存放的位置！</span><br><span class="line">好了 就到这里！ 简单吧 ！</span><br><span class="line">　　</span><br><span class="line">　　其他的？ 自己悟把！哈哈哈！</span><br></pre></td></tr></table></figure>
    </div>

    

    
        <div class="post-tags">
            <i class="fa fa-tags" aria-hidden="true"></i>
            <a href="/blog/tags/通用爬虫/">#通用爬虫</a> <a href="/blog/tags/爬虫/">#爬虫</a>
        </div>
    

    <!-- Comments -->
    

</div>
        </section>

    </div>
</div>


</div>

<!-- Footer -->
<div class="push"></div>

<footer class="footer-content">
    <div class="container">
        <div class="row">
            <div class="col-xs-12 col-sm-12 col-md-6 col-lg-6 footer-about">
                <h2>About</h2>
                <p>
                    This theme was developed by <a href="https://github.com/klugjo">Jonathan Klughertz</a>. The source code is available on Github. Create Websites. Make Magic.
                </p>
            </div>
            
    <div class="col-xs-6 col-sm-6 col-md-3 col-lg-3 recent-posts">
        <h2>Recent Posts</h2>
        <ul>
            
            <li>
                <a class="footer-post" href="/blog/2019/10/25/tars框架环境基础搭建/">tars框架环境基础搭建</a>
            </li>
            
            <li>
                <a class="footer-post" href="/blog/2019/10/20/认识tars框架/">认识tars框架</a>
            </li>
            
            <li>
                <a class="footer-post" href="/blog/2019/10/05/redis数据迁移备份与恢复/">redis数据迁移备份与恢复</a>
            </li>
            
            <li>
                <a class="footer-post" href="/blog/2019/09/21/负载均衡的5种策略/">负载均衡的5种策略</a>
            </li>
            
        </ul>
    </div>



            
        </div>
        <div class="row">
            <div class="col-xs-12 col-sm-12 col-md-12 col-lg-12">
                <ul class="list-inline footer-social-icons">
                    
                    <li class="list-inline-item">
                        <a href="https://github.com/klugjo/hexo-theme-alpha-dust">
                            <span class="footer-icon-container">
                                <i class="fa fa-github"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://twitter.com/?lang=en">
                            <span class="footer-icon-container">
                                <i class="fa fa-twitter"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://www.facebook.com/">
                            <span class="footer-icon-container">
                                <i class="fa fa-facebook"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://www.instagram.com/">
                            <span class="footer-icon-container">
                                <i class="fa fa-instagram"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://dribbble.com/">
                            <span class="footer-icon-container">
                                <i class="fa fa-dribbble"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://plus.google.com/">
                            <span class="footer-icon-container">
                                <i class="fa fa-google-plus"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://www.behance.net/">
                            <span class="footer-icon-container">
                                <i class="fa fa-behance"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="https://500px.com/">
                            <span class="footer-icon-container">
                                <i class="fa fa-500px"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="mailto:test@example.com">
                            <span class="footer-icon-container">
                                <i class="fa fa-envelope-o"></i>
                            </span>
                        </a>
                    </li>
                    
                    
                    <li class="list-inline-item">
                        <a href="\#">
                            <span class="footer-icon-container">
                                <i class="fa fa-rss"></i>
                            </span>
                        </a>
                    </li>
                    
                </ul>
            </div>
        </div>
        <div class="row">
            <div class="col-xs-12 col-sm-12 col-md-12 col-lg-12">
                <div class="footer-copyright">
                    @Untitled. All right reserved | Design & Hexo <a href="http://www.codeblocq.com/">Jonathan Klughertz</a>
                </div>
            </div>
        </div>
    </div>
</footer>

<!-- After footer scripts -->

<!-- jQuery -->
<script src="//code.jquery.com/jquery-2.1.4.min.js"></script>

<!-- Tween Max -->
<script src="//cdnjs.cloudflare.com/ajax/libs/gsap/1.18.5/TweenMax.min.js"></script>

<!-- Gallery -->
<script src="//cdnjs.cloudflare.com/ajax/libs/featherlight/1.3.5/featherlight.min.js" type="text/javascript" charset="utf-8"></script>

<!-- Custom JavaScript -->
<script src="/blog/js/main.js"></script>

<!-- Disqus Comments -->



</body>

</html>